Runbook: Loki Ingestion Failure
Alert
- Prometheus Alert:
LokiIngestionRateDrop/LokiRequestErrors/LokiStreamLimitExceeded - Grafana Dashboard: Loki dashboard
- Firing condition: Loki ingestion rate drops by more than 50% compared to the previous hour, or Loki returns 429/500 errors to Alloy collectors
Severity
Warning -- Log ingestion failure means platform and application logs are being dropped. This creates gaps in observability and audit trails, violating NIST AU-2 (Audit Events) and AU-4 (Audit Storage Capacity) controls.
Impact
- New logs from pods and node journals are not being stored
- Grafana log queries return incomplete results
- Audit trail gaps for compliance (NIST AU-2, AU-12)
- Alert rules based on log queries (LogQL) will not fire
- Incident investigation is impaired due to missing log data
Investigation Steps
- Check Loki pod status:
kubectl get pods -n logging -l app.kubernetes.io/name=loki
- Check Loki logs for errors:
kubectl logs -n logging -l app.kubernetes.io/name=loki --tail=200
- Check Alloy (log collector) pod status:
kubectl get pods -n logging -l app.kubernetes.io/name=alloy
- Check Alloy logs for delivery errors:
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=200 | grep -i "error\|429\|500\|dropped\|failed"
- Check Loki metrics for ingestion rate:
Prometheus query: rate(loki_distributor_bytes_received_total[5m])
- Check for rate limiting:
kubectl logs -n logging -l app.kubernetes.io/name=loki --tail=200 | grep -i "rate\|limit\|429"
- Check Loki storage (filesystem in SingleBinary mode):
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- df -h /tmp/loki
- Check the HelmRelease status:
flux get helmrelease loki -n logging
flux get helmrelease alloy -n logging
- Check if Loki can be reached from Alloy:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=alloy -o name | head -1) -- wget -q -O- http://loki.logging.svc.cluster.local:3100/ready
- Check Loki readiness:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- wget -q -O- http://localhost:3100/ready
Resolution
Loki storage full
- Check current disk usage:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- du -sh /tmp/loki/chunks /tmp/loki/rules
- If the storage is full, reduce the retention period temporarily:
# In the Loki HelmRelease values
loki:
limits_config:
retention_period: "168h" # Reduce from 720h to 168h (7 days)
- Push the change to Git and reconcile:
flux reconcile helmrelease loki -n logging --with-source
- If using PVC storage, expand the PVC (if the StorageClass supports expansion):
kubectl edit pvc storage-loki-0 -n logging
- For immediate relief, restart Loki to trigger compaction:
kubectl rollout restart statefulset loki -n logging
Loki rate limiting (429 errors)
- Check the current rate limits:
kubectl get helmrelease loki -n logging -o yaml | grep -A 10 "limits_config"
- Increase the ingestion rate limits:
loki:
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: "5MB"
per_stream_rate_limit_burst: "15MB"
- Alternatively, reduce log volume by filtering noisy namespaces in the Alloy configuration
Alloy pods not collecting logs
- Check Alloy DaemonSet:
kubectl get daemonset alloy -n logging
- Verify all nodes have an Alloy pod:
kubectl get pods -n logging -l app.kubernetes.io/name=alloy -o wide
- If a pod is missing, check node taints:
kubectl describe node <node-name> | grep Taints
- Restart the Alloy DaemonSet:
kubectl rollout restart daemonset alloy -n logging
Alloy cannot reach Loki
- Test connectivity:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=alloy -o name | head -1) -- wget -q -O- http://loki.logging.svc.cluster.local:3100/ready
- Check NetworkPolicies in the logging namespace:
kubectl get networkpolicies -n logging
- Verify the Loki service exists and has endpoints:
kubectl get svc loki -n logging
kubectl get endpoints loki -n logging
Loki pod crash-looping
- Check logs from the previous crash:
kubectl logs -n logging -l app.kubernetes.io/name=loki --previous --tail=100
- Common causes:
- Out of memory (check resource limits)
- Corrupted WAL (write-ahead log)
-
Schema migration failure after upgrade
-
If WAL corruption is suspected:
# Delete the WAL directory (data since last flush will be lost)
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- rm -rf /tmp/loki/wal
kubectl rollout restart statefulset loki -n logging
- If out of memory, increase limits in the HelmRelease
High cardinality labels causing memory pressure
- Check for high cardinality:
LogQL query in Grafana: count(count by (__name__)({__name__=~".+"}))
- Identify namespaces producing the most log volume:
LogQL: sum by (namespace) (rate({namespace=~".+"} | __error__="" [5m]))
- Add label drop rules in the Alloy configuration for high-cardinality labels that are not useful
Emergency: log data gap recovery
If logs were lost during the outage:
- Check if the Alloy pods buffered logs locally (they do not persist to disk by default -- logs during the outage are lost)
- For critical audit events, check the Kubernetes API audit log on the server node:
ssh sre-admin@<server-ip> "sudo cat /var/lib/rancher/rke2/server/logs/audit.log | tail -500"
- Check node journals for events during the gap:
ssh sre-admin@<node-ip> "sudo journalctl --since '<start-time>' --until '<end-time>'"
Prevention
- Monitor
loki_ingester_chunks_stored_totalandloki_distributor_bytes_received_totalmetrics - Alert on ingestion rate drops greater than 50% over 15 minutes
- Alert on Loki storage usage at 70% (warning) and 85% (critical)
- Set appropriate retention periods based on compliance requirements (currently 720h / 30 days)
- Review and optimize Alloy relabeling rules to drop unnecessary labels
- Ensure Loki has sufficient memory for the expected log volume
- Configure Alloy to drop debug-level logs from noisy namespaces in non-production environments
Escalation
- If log ingestion is completely stopped: this is a compliance violation (NIST AU-2, AU-12) -- escalate to platform team lead
- If Loki data is corrupted and unrecoverable: restore from Velero backup of the logging namespace
- If the issue is caused by excessive log volume from a tenant application: notify the tenant team to reduce log verbosity